
# README for replication materials for Goet, Niels, “Measuring Polarisation with Text Analysis: Evidence from the UK House of Commons, 1811-2015.”

This README file describes the contents of each folder in the replication data repository.

# SETUP AND BENCHMARKS
## OS
Original code run on MAC OSX 10.14, 16gb RAM, 1 cpu (2.5GHz), 4 cores.

## Benchmarks
Estimates of run times for the scripts to produce the results of the main analyses and of the supplementary materials are outlined below. 

All scripts listed below are called and executed through a single script (`execute_replication.R`, see details below). (Please note that the R scripts support parallel processing and that replication time is dependent on how many cores you can allocate. For a standard laptop, I recommend using the default options.) This script also installs all dependencies, provided you uncomment the relevant lines at the start of the script. 

### Main analysis
Run time per script (in order of replication): 

* `wordshoal_with_ca.R`: 62 minutes 
* `dimension_scaling.R`: 23 minutes
* `full_scaling.R`: 143 minutes
* `generate_matrices_ppr.py`: 464 minutes
* `SGD_CLASSIFIER.py`: 842 minutes
* `generate_figures.R`: 8 minutes

### On-line supplementary materials
Estimated run time per script (in order of replication): 
 
* `SGD_CLASSIFIER_IE_US.py` (for US Senate/Irish Dáil): 70.02 minutes (Senate); 23 minutes (Dáil)
* `NB_CLASSIFIER.py`: 760 minutes
* `WordShoalAnalysisUSSenate.R`: 16 minutes
* `compareCFscores.R`: 3 minutes
* `generate_figuresF1_F2_tableF2.R`: 4 minutes
* `compare_sgd_implementations_ps_vs_nppr.R` (Table E1): 0.35 minutes
* `compare_implementations_NB_vs_SGD.R` (Table D1): 8 minutes
* `generate_figures_appendix.R` (figures B1-B3, E1): 8 minutes
* `skpca_ps.R`: 634 minutes
* `per_debate_scaling.R`: 465 minutes

# ABBREVIATIONS

	▪	(n)ppr 			= (no) procedural phrases removed
	▪	(n)ps 			= (no) procedural phrases stripped (equivalent to nppr)
	▪	NB		 	= naive bayes
	▪	1b 			= machine classifier implementation with party weights
	▪	1a 			= machine classifier implementation without party weights
	▪	SGD 			= stochastic gradient descent
	▪	NB 			= (Multinomial) Naive Bayes
	▪	ukhcdeb	    		= UK House of Commons Debates
	▪	WFM		   	= Word Frequency Matrix
	▪	CA 			= Correspondence Analysis
	▪	SKPCA 			= String Kernel Principal Components Analysis


# PYTHON SCRIPTS
The python scripts contained in the replication repository are coded in python3, relying on the scikit-learn library. Where appropriate, instructions on how to run the scripts are detailed below, under the relevant folder headings. 

Dependencies:

* pandas
* scikit-learn
* csv
* random
* math


# R SCRIPTS
All R scripts are optimised to run in RStudio.

Dependencies:

* Rcpp
* tm
* data.table
* gridExtra
* slam
* base
* RcppArmadillo
* boot
* rlist
* ggplot2
* kernlab
* austin
* dplyr
* methods
* ggrepel
* psych
* Hmisc
* plyr
* pacman
* ca
* KernSmooth
* reshape
* tidyr
* reshape2
* stringr
* inline
* stats
* MASS
* rjags
* quanteda

# Replication process
Prior to running the replication process, please unzip all folders in the repository. The folder structure needs to be maintained for the replication process to be successful. The analysis relies on a combination of R and Python scripts, but the complete replication can be achieved by running the `execute_replication.R`
script. This script runs the required parts of the code in the correct order to obtain the results reported in the paper. 

Each file, figure, and table generated by the code is logged in the ukhcpol_logfile.log logfile. I recommend tailing the log file to keep track of progress. This is especially the case since the generation of the sparse matrices for the machine classifier takes a considerable amount of time (see estimates for each script under the section named "SETUP AND BENCHMARKS"). Please also note that the matrices will take up a lot of hard disk space (+/- 20gb). (I can share the sparse matrices directly upon request.)

Details of the individual scripts/codes and subfolder structure are provided in the sections below. 

Replication steps:
1. Unzip all compressed folders and files
2. Run the `execute\_replication.R` script from RStudio. 

# FOLDERS
## raw\_data
Contains the following files:
	
* `brs_data.csv` (data from the 1992 wave of the British Candidate Survey (BCS) (Norris and Lovenduski 1995) which are used to generate figure 4).  
* `cmp_polarisation.csv` (data from the Comparative Manifesto Project (CMP, Volkens et al. 2016) , used to generate figure 5).
* `sessions_data.csv` (data on session dates of the UK HoC, own compilation).
* `ukhcdeb_nppr.csv` (UK House of Commons Debates Data as described in the paper, where procedural phrases have not been excluded).
* `ukhcdeb_ppr.csv` (UK House of Commons Debates Data as described in the paper, where procedural phrases have been excluded).

## machine\_learning\_implementations

This folder contains all scripts and data to implement the machine learning algorithms to measure polarisation on the basis of classification accuracy.

`NB\_CLASSIFIER.py`/`SGD\_CLASSIFIER.py` are scripts to implement the SGD and NB classifiers as described in the paper. The scripts are coded in python3 and require installation of the “scikit learn” library, and all other dependencies outlined above. The output from the classifiers are included in the following folders:

* `1a_estimates_nps`
* `1a_estimates_nps_NB`
* `1a_estimates_ps`
* `1a_estimates_ps_NB`
* `1b_estimates_nps`
* `1b_estimates_nps_NB`
* `1b_estimates_ps`
* `1b_estimates_ps_NB`

`sgd\_scores\_all\_parties\_1810\_20151b.csv` contains the sessional accuracy levels of the sgd classifier with party weights (the same file is also contained in the `1b\_estimates\_ps` folder). 

## poisson\_model\_implementations

This folder contains all scripts and data to implement the poisson scaling algorithms. All important output files that are used to generate the figures are contained both in the main folder and in the sub-folder in which they are generated by the script. 

### subfolder: 1.full\_scaling

* `full_scaling.R` 				= R script to implement Wordfish after aggregation of speeches by legislator per session, as described in the paper.
* `full_scaling_estimates_ps.csv` 	 	= Output from the `full_scaling.R` script
* `wfms` folder = contains WFMs used in this estimation approach


### subfolder: 2.dimension\_scaling

* `4.dimension_scaling.R`				= implements the scaling on the basis of the data obtained.
* `dimension_scaling_estimates75ps.csv`		= contains the scaling estimates where the threshold for inclusion as an economic speech is a .75 probability of falling in that class. 
* `dimension_scaling_estimates99ps.csv`		= contains the scaling estimates where the threshold for inclusion as an economic speech is a .99 probability of falling in that class. 
* `wfms0.75` = folder containing WFMs used in the dimension scaling approach with a threshold of 0.75 (approach described in paper) 
* `wfms0.99` = folder containing WFMs used in the dimension scaling approach with a threshold of 0.99 (approach described in paper) 


### subfolder: 3.wordshoal
* `wfms_ps` folder 			= contains the WFMs used in the Wordshoal estimation
* `wordshoal_with_ca_ps.R`		= Lauderdale and Herzog’s Wordshoal script, with some minor modifications to generate correspondence analysis estimates in the same loop. 
	* `per_debate_scaling.R`		= Script to estimate Wordfish at the debate level.
	* `debate_level_estimates_ps.csv`	= Output from the `per_debate_scaling.R` script.

*Note*: `Wordshoal9.R` in the is `wfms_ps` folder is from Lauderdale and Herzog (2016). (cf. Lauderdale, Benjamin E., and Herzog, Alexander. 2016. Replication data for: Measuring political positions from legislative speech. **Harvard Dataverse**. http://dx.doi.org/10.7910/DVN/RQMIV3 CrossRef).  

## online\_appendix
Contains all data and scripts for the model/estimations/figures contained in the on-line supplementary materials.

*Note*: Where indicated, to replicate the results for Wordshoal and to run machine classification for the US Senate and Irish Dáil, data and scripts have been adapted from Lauderdale and Herzog (2016). (cf. Lauderdale, Benjamin E., and Herzog, Alexander. 2016. Replication data for: Measuring political positions from legislative speech. **Harvard Dataverse**. http://dx.doi.org/10.7910/DVN/RQMIV3 CrossRef).  | 

### subfolder: data
This subfolder contains results from the models applied to the data where procedural phrases are included. As noted above, these estimates are obtained using the same scripts as for the main application in the paper, but are based on the ukhcdeb_nppr.csv file.

* `wfms_nppr` = contains WFMS for Wordshoal (procedural phrases not removed)
* `UK_wordshoal_nps.csv`	= contains Wordshoal estimates (procedural phrases not removed)
* `UK_wordfish_estimates_nppr.csv` = contains Wordfish estimates ((procedural phrases not removed)

### subfolder: figures
Contains the `generate\_figures\_appendix.R` script, which contains code to create Figures A1, A2, and A3. 

Contains `generate_figuresF1_F2_tableF2.R` script that generates figures figureF1a_29th_Dail.pd, figureF1a_30th_Dail.pdf, and FigureF2.pdf, and Table F2. (Figure F3 is generated by the `compareCFscores.R` script, see below).

### subfolder: skpca
* `debate_data` = folder that contains .RData files for the skpca estimation 
* `skpca_ps.R` 			= script to estimate string kernel principal components analysis.
* `results1810_2015skpca_ps.csv`  	= Contains the estimates obtained with the `skpca_ps.R` script.

### subfolder: tables
* `tableF2`: folder where Table F2 is saved from the `generate_figuresF1_F2_tableF2.R` script
* `tableD1`: folder that contains the `compare_implementations_NB_vs_SGD.R` script to generate Table D1
* `tableE2`: folder that contains the `compare_sgd_implementations_ps_vs_nppr.R` script to generate Table E1

### subfolder: ReplicationLauderdaleHerzog2016
Contains sub-folders with scripts and data to replicate the machine classifier analyses for the Irish Dáil and the US Senate.
	
* `Analysis` = folder that contains:
	* `WordShoalAnalysisUSSenate.R`: script to run Wordshoal on US Senate data (based on Lauderdale and Herzog)
	* `compareCFscores.R` script as well as .ord files to generate Figure F3
* `Estimates` = folder that contains Wordshoal estimates from Lauderdale and Herzog's replication data, which are needed in the generation of the figures
* `Wordshoal9.R` = Lauderdale and Herzog's Wordshoal software
* `Processed_speeches` = folder that contains Lauderdale and Herzog's Dáil and US Senate speech data
* `machine_classification` = folder that contains: 

	 * `SGD_CLASSIFIER_IE_US.py`: script that performs the machine classification
	 * `matrices`: a sub-folder with sparse matrices for the US Senate and Dáil data to which the machine classifier is applied
	 * `1b_estimates`: a sub-folder in which the estimates from the machine classifier are stored
	 * tab-delimited files that contain US Senate and Dáil speeches.
